Credit Risk Prediction Model for Default Prevention

Final Project
Foundations of Data Science with R (STAT 359)

Author

Casey Arbelaez

Published

August 25, 2024

Introduction

Project Overview

In this project, we aim to build a predictive model for credit risk with the goal of preventing default. The model will help financial institutions better assess the risk associated with credit applications. This report details the process from data loading and exploration to model building and evaluation.

Project Repository

You can find the code and data for this project in my https://github.com/STAT359-2024SU/359-final-project-CaseyArbelaez.

Data Source

The dataset used for this analysis is sourced from Kaggle and contains various attributes related to credit applications. The data will be used to predict credit risk and improve default prevention strategies.

Necessary Libraries

Before diving into the data, we need to load the essential libraries required for our analysis. These libraries include packages for data manipulation, model training, and evaluation.

Data Loading

Loading the Dataset

We start by loading the dataset into a variable for further analysis. This step is crucial as it prepares the data for subsequent preprocessing and modeling tasks.

   SK_ID_CURR         TARGET        NAME_CONTRACT_TYPE CODE_GENDER       
 Min.   :100002   Min.   :0.00000   Length:307511      Length:307511     
 1st Qu.:189146   1st Qu.:0.00000   Class :character   Class :character  
 Median :278202   Median :0.00000   Mode  :character   Mode  :character  
 Mean   :278181   Mean   :0.08073                                        
 3rd Qu.:367143   3rd Qu.:0.00000                                        
 Max.   :456255   Max.   :1.00000                                        
                                                                         
 FLAG_OWN_CAR       FLAG_OWN_REALTY     CNT_CHILDREN     AMT_INCOME_TOTAL   
 Length:307511      Length:307511      Min.   : 0.0000   Min.   :    25650  
 Class :character   Class :character   1st Qu.: 0.0000   1st Qu.:   112500  
 Mode  :character   Mode  :character   Median : 0.0000   Median :   147150  
                                       Mean   : 0.4171   Mean   :   168798  
                                       3rd Qu.: 1.0000   3rd Qu.:   202500  
                                       Max.   :19.0000   Max.   :117000000  
                                                                            
   AMT_CREDIT       AMT_ANNUITY     AMT_GOODS_PRICE   NAME_TYPE_SUITE   
 Min.   :  45000   Min.   :  1616   Min.   :  40500   Length:307511     
 1st Qu.: 270000   1st Qu.: 16524   1st Qu.: 238500   Class :character  
 Median : 513531   Median : 24903   Median : 450000   Mode  :character  
 Mean   : 599026   Mean   : 27109   Mean   : 538396                     
 3rd Qu.: 808650   3rd Qu.: 34596   3rd Qu.: 679500                     
 Max.   :4050000   Max.   :258026   Max.   :4050000                     
                   NA's   :12       NA's   :278                         
 NAME_INCOME_TYPE   NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE 
 Length:307511      Length:307511       Length:307511      Length:307511     
 Class :character   Class :character    Class :character   Class :character  
 Mode  :character   Mode  :character    Mode  :character   Mode  :character  
                                                                             
                                                                             
                                                                             
                                                                             
 REGION_POPULATION_RELATIVE   DAYS_BIRTH     DAYS_EMPLOYED    DAYS_REGISTRATION
 Min.   :0.00029            Min.   :-25229   Min.   :-17912   Min.   :-24672   
 1st Qu.:0.01001            1st Qu.:-19682   1st Qu.: -2760   1st Qu.: -7480   
 Median :0.01885            Median :-15750   Median : -1213   Median : -4504   
 Mean   :0.02087            Mean   :-16037   Mean   : 63815   Mean   : -4986   
 3rd Qu.:0.02866            3rd Qu.:-12413   3rd Qu.:  -289   3rd Qu.: -2010   
 Max.   :0.07251            Max.   : -7489   Max.   :365243   Max.   :     0   
                                                                               
 DAYS_ID_PUBLISH  OWN_CAR_AGE       FLAG_MOBIL FLAG_EMP_PHONE  
 Min.   :-7197   Min.   : 0.00    Min.   :0    Min.   :0.0000  
 1st Qu.:-4299   1st Qu.: 5.00    1st Qu.:1    1st Qu.:1.0000  
 Median :-3254   Median : 9.00    Median :1    Median :1.0000  
 Mean   :-2994   Mean   :12.06    Mean   :1    Mean   :0.8199  
 3rd Qu.:-1720   3rd Qu.:15.00    3rd Qu.:1    3rd Qu.:1.0000  
 Max.   :    0   Max.   :91.00    Max.   :1    Max.   :1.0000  
                 NA's   :202929                                
 FLAG_WORK_PHONE  FLAG_CONT_MOBILE   FLAG_PHONE       FLAG_EMAIL     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :1.0000   Median :0.0000   Median :0.00000  
 Mean   :0.1994   Mean   :0.9981   Mean   :0.2811   Mean   :0.05672  
 3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
                                                                     
 OCCUPATION_TYPE    CNT_FAM_MEMBERS  REGION_RATING_CLIENT
 Length:307511      Min.   : 1.000   Min.   :1.000       
 Class :character   1st Qu.: 2.000   1st Qu.:2.000       
 Mode  :character   Median : 2.000   Median :2.000       
                    Mean   : 2.153   Mean   :2.052       
                    3rd Qu.: 3.000   3rd Qu.:2.000       
                    Max.   :20.000   Max.   :3.000       
                    NA's   :2                            
 REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START
 Min.   :1.000               Length:307511              Min.   : 0.00          
 1st Qu.:2.000               Class :character           1st Qu.:10.00          
 Median :2.000               Mode  :character           Median :12.00          
 Mean   :2.032                                          Mean   :12.06          
 3rd Qu.:2.000                                          3rd Qu.:14.00          
 Max.   :3.000                                          Max.   :23.00          
                                                                               
 REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
 Min.   :0.00000            Min.   :0.00000           
 1st Qu.:0.00000            1st Qu.:0.00000           
 Median :0.00000            Median :0.00000           
 Mean   :0.01514            Mean   :0.05077           
 3rd Qu.:0.00000            3rd Qu.:0.00000           
 Max.   :1.00000            Max.   :1.00000           
                                                      
 LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY
 Min.   :0.00000             Min.   :0.00000        Min.   :0.0000        
 1st Qu.:0.00000             1st Qu.:0.00000        1st Qu.:0.0000        
 Median :0.00000             Median :0.00000        Median :0.0000        
 Mean   :0.04066             Mean   :0.07817        Mean   :0.2305        
 3rd Qu.:0.00000             3rd Qu.:0.00000        3rd Qu.:0.0000        
 Max.   :1.00000             Max.   :1.00000        Max.   :1.0000        
                                                                          
 LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE   EXT_SOURCE_1     EXT_SOURCE_2   
 Min.   :0.0000          Length:307511      Min.   :0.01     Min.   :0.0000  
 1st Qu.:0.0000          Class :character   1st Qu.:0.33     1st Qu.:0.3925  
 Median :0.0000          Mode  :character   Median :0.51     Median :0.5660  
 Mean   :0.1796                             Mean   :0.50     Mean   :0.5144  
 3rd Qu.:0.0000                             3rd Qu.:0.68     3rd Qu.:0.6636  
 Max.   :1.0000                             Max.   :0.96     Max.   :0.8550  
                                            NA's   :173378   NA's   :660     
  EXT_SOURCE_3   APARTMENTS_AVG   BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG
 Min.   :0.00    Min.   :0.00     Min.   :0.00     Min.   :0.00               
 1st Qu.:0.37    1st Qu.:0.06     1st Qu.:0.04     1st Qu.:0.98               
 Median :0.54    Median :0.09     Median :0.08     Median :0.98               
 Mean   :0.51    Mean   :0.12     Mean   :0.09     Mean   :0.98               
 3rd Qu.:0.67    3rd Qu.:0.15     3rd Qu.:0.11     3rd Qu.:0.99               
 Max.   :0.90    Max.   :1.00     Max.   :1.00     Max.   :1.00               
 NA's   :60965   NA's   :156061   NA's   :179943   NA's   :150007             
 YEARS_BUILD_AVG  COMMONAREA_AVG   ELEVATORS_AVG    ENTRANCES_AVG   
 Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
 1st Qu.:0.69     1st Qu.:0.01     1st Qu.:0.00     1st Qu.:0.07    
 Median :0.76     Median :0.02     Median :0.00     Median :0.14    
 Mean   :0.75     Mean   :0.04     Mean   :0.08     Mean   :0.15    
 3rd Qu.:0.82     3rd Qu.:0.05     3rd Qu.:0.12     3rd Qu.:0.21    
 Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
 NA's   :204488   NA's   :214865   NA's   :163891   NA's   :154828  
 FLOORSMAX_AVG    FLOORSMIN_AVG     LANDAREA_AVG    LIVINGAPARTMENTS_AVG
 Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00        
 1st Qu.:0.17     1st Qu.:0.08     1st Qu.:0.02     1st Qu.:0.05        
 Median :0.17     Median :0.21     Median :0.05     Median :0.08        
 Mean   :0.23     Mean   :0.23     Mean   :0.07     Mean   :0.10        
 3rd Qu.:0.33     3rd Qu.:0.38     3rd Qu.:0.09     3rd Qu.:0.12        
 Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00        
 NA's   :153020   NA's   :208642   NA's   :182590   NA's   :210199      
 LIVINGAREA_AVG   NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE 
 Min.   :0.00     Min.   :0.00            Min.   :0.00      Min.   :0.00    
 1st Qu.:0.05     1st Qu.:0.00            1st Qu.:0.00      1st Qu.:0.05    
 Median :0.07     Median :0.00            Median :0.00      Median :0.08    
 Mean   :0.11     Mean   :0.01            Mean   :0.03      Mean   :0.11    
 3rd Qu.:0.13     3rd Qu.:0.00            3rd Qu.:0.03      3rd Qu.:0.14    
 Max.   :1.00     Max.   :1.00            Max.   :1.00      Max.   :1.00    
 NA's   :154350   NA's   :213514          NA's   :169682    NA's   :156061  
 BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE
 Min.   :0.00      Min.   :0.00                 Min.   :0.00    
 1st Qu.:0.04      1st Qu.:0.98                 1st Qu.:0.70    
 Median :0.07      Median :0.98                 Median :0.76    
 Mean   :0.09      Mean   :0.98                 Mean   :0.76    
 3rd Qu.:0.11      3rd Qu.:0.99                 3rd Qu.:0.82    
 Max.   :1.00      Max.   :1.00                 Max.   :1.00    
 NA's   :179943    NA's   :150007               NA's   :204488  
 COMMONAREA_MODE  ELEVATORS_MODE   ENTRANCES_MODE   FLOORSMAX_MODE  
 Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
 1st Qu.:0.01     1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17    
 Median :0.02     Median :0.00     Median :0.14     Median :0.17    
 Mean   :0.04     Mean   :0.07     Mean   :0.15     Mean   :0.22    
 3rd Qu.:0.05     3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33    
 Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
 NA's   :214865   NA's   :163891   NA's   :154828   NA's   :153020  
 FLOORSMIN_MODE   LANDAREA_MODE    LIVINGAPARTMENTS_MODE LIVINGAREA_MODE 
 Min.   :0.00     Min.   :0.00     Min.   :0.00          Min.   :0.00    
 1st Qu.:0.08     1st Qu.:0.02     1st Qu.:0.05          1st Qu.:0.04    
 Median :0.21     Median :0.05     Median :0.08          Median :0.07    
 Mean   :0.23     Mean   :0.06     Mean   :0.11          Mean   :0.11    
 3rd Qu.:0.38     3rd Qu.:0.08     3rd Qu.:0.13          3rd Qu.:0.13    
 Max.   :1.00     Max.   :1.00     Max.   :1.00          Max.   :1.00    
 NA's   :208642   NA's   :182590   NA's   :210199        NA's   :154350  
 NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI  BASEMENTAREA_MEDI
 Min.   :0.00             Min.   :0.00       Min.   :0.00     Min.   :0.00     
 1st Qu.:0.00             1st Qu.:0.00       1st Qu.:0.06     1st Qu.:0.04     
 Median :0.00             Median :0.00       Median :0.09     Median :0.08     
 Mean   :0.01             Mean   :0.03       Mean   :0.12     Mean   :0.09     
 3rd Qu.:0.00             3rd Qu.:0.02       3rd Qu.:0.15     3rd Qu.:0.11     
 Max.   :1.00             Max.   :1.00       Max.   :1.00     Max.   :1.00     
 NA's   :213514           NA's   :169682     NA's   :156061   NA's   :179943   
 YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI 
 Min.   :0.00                 Min.   :0.00     Min.   :0.00    
 1st Qu.:0.98                 1st Qu.:0.69     1st Qu.:0.01    
 Median :0.98                 Median :0.76     Median :0.02    
 Mean   :0.98                 Mean   :0.76     Mean   :0.04    
 3rd Qu.:0.99                 3rd Qu.:0.83     3rd Qu.:0.05    
 Max.   :1.00                 Max.   :1.00     Max.   :1.00    
 NA's   :150007               NA's   :204488   NA's   :214865  
 ELEVATORS_MEDI   ENTRANCES_MEDI   FLOORSMAX_MEDI   FLOORSMIN_MEDI  
 Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
 1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17     1st Qu.:0.08    
 Median :0.00     Median :0.14     Median :0.17     Median :0.21    
 Mean   :0.08     Mean   :0.15     Mean   :0.23     Mean   :0.23    
 3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33     3rd Qu.:0.38    
 Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
 NA's   :163891   NA's   :154828   NA's   :153020   NA's   :208642  
 LANDAREA_MEDI    LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI 
 Min.   :0.00     Min.   :0.00          Min.   :0.00    
 1st Qu.:0.02     1st Qu.:0.05          1st Qu.:0.05    
 Median :0.05     Median :0.08          Median :0.07    
 Mean   :0.07     Mean   :0.10          Mean   :0.11    
 3rd Qu.:0.09     3rd Qu.:0.12          3rd Qu.:0.13    
 Max.   :1.00     Max.   :1.00          Max.   :1.00    
 NA's   :182590   NA's   :210199        NA's   :154350  
 NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE
 Min.   :0.00             Min.   :0.00       Length:307511     
 1st Qu.:0.00             1st Qu.:0.00       Class :character  
 Median :0.00             Median :0.00       Mode  :character  
 Mean   :0.01             Mean   :0.03                         
 3rd Qu.:0.00             3rd Qu.:0.03                         
 Max.   :1.00             Max.   :1.00                         
 NA's   :213514           NA's   :169682                       
 HOUSETYPE_MODE     TOTALAREA_MODE   WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
 Length:307511      Min.   :0.00     Length:307511      Length:307511      
 Class :character   1st Qu.:0.04     Class :character   Class :character   
 Mode  :character   Median :0.07     Mode  :character   Mode  :character   
                    Mean   :0.10                                           
                    3rd Qu.:0.13                                           
                    Max.   :1.00                                           
                    NA's   :148431                                         
 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
 Min.   :  0.000          Min.   : 0.0000          Min.   :  0.000         
 1st Qu.:  0.000          1st Qu.: 0.0000          1st Qu.:  0.000         
 Median :  0.000          Median : 0.0000          Median :  0.000         
 Mean   :  1.422          Mean   : 0.1434          Mean   :  1.405         
 3rd Qu.:  2.000          3rd Qu.: 0.0000          3rd Qu.:  2.000         
 Max.   :348.000          Max.   :34.0000          Max.   :344.000         
 NA's   :1021             NA's   :1021             NA's   :1021            
 DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2   
 Min.   : 0.0             Min.   :-4292.0        Min.   :0.00e+00  
 1st Qu.: 0.0             1st Qu.:-1570.0        1st Qu.:0.00e+00  
 Median : 0.0             Median : -757.0        Median :0.00e+00  
 Mean   : 0.1             Mean   : -962.9        Mean   :4.23e-05  
 3rd Qu.: 0.0             3rd Qu.: -274.0        3rd Qu.:0.00e+00  
 Max.   :24.0             Max.   :    0.0        Max.   :1.00e+00  
 NA's   :1021             NA's   :1                                
 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4    FLAG_DOCUMENT_5   FLAG_DOCUMENT_6  
 Min.   :0.00    Min.   :0.00e+00   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00    1st Qu.:0.00e+00   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :1.00    Median :0.00e+00   Median :0.00000   Median :0.00000  
 Mean   :0.71    Mean   :8.13e-05   Mean   :0.01511   Mean   :0.08806  
 3rd Qu.:1.00    3rd Qu.:0.00e+00   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00    Max.   :1.00e+00   Max.   :1.00000   Max.   :1.00000  
                                                                       
 FLAG_DOCUMENT_7     FLAG_DOCUMENT_8   FLAG_DOCUMENT_9    FLAG_DOCUMENT_10  
 Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00e+00  
 1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00e+00  
 Median :0.0000000   Median :0.00000   Median :0.000000   Median :0.00e+00  
 Mean   :0.0001919   Mean   :0.08138   Mean   :0.003896   Mean   :2.28e-05  
 3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00e+00  
 Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00e+00  
                                                                            
 FLAG_DOCUMENT_11   FLAG_DOCUMENT_12  FLAG_DOCUMENT_13   FLAG_DOCUMENT_14  
 Min.   :0.000000   Min.   :0.0e+00   Min.   :0.000000   Min.   :0.000000  
 1st Qu.:0.000000   1st Qu.:0.0e+00   1st Qu.:0.000000   1st Qu.:0.000000  
 Median :0.000000   Median :0.0e+00   Median :0.000000   Median :0.000000  
 Mean   :0.003912   Mean   :6.5e-06   Mean   :0.003525   Mean   :0.002936  
 3rd Qu.:0.000000   3rd Qu.:0.0e+00   3rd Qu.:0.000000   3rd Qu.:0.000000  
 Max.   :1.000000   Max.   :1.0e+00   Max.   :1.000000   Max.   :1.000000  
                                                                           
 FLAG_DOCUMENT_15  FLAG_DOCUMENT_16   FLAG_DOCUMENT_17    FLAG_DOCUMENT_18 
 Min.   :0.00000   Min.   :0.000000   Min.   :0.0000000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.00000  
 Median :0.00000   Median :0.000000   Median :0.0000000   Median :0.00000  
 Mean   :0.00121   Mean   :0.009928   Mean   :0.0002667   Mean   :0.00813  
 3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.000000   Max.   :1.0000000   Max.   :1.00000  
                                                                           
 FLAG_DOCUMENT_19    FLAG_DOCUMENT_20    FLAG_DOCUMENT_21   
 Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
 1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000  
 Median :0.0000000   Median :0.0000000   Median :0.0000000  
 Mean   :0.0005951   Mean   :0.0005073   Mean   :0.0003349  
 3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000  
 Max.   :1.0000000   Max.   :1.0000000   Max.   :1.0000000  
                                                            
 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
 Min.   :0.00               Min.   :0.00             
 1st Qu.:0.00               1st Qu.:0.00             
 Median :0.00               Median :0.00             
 Mean   :0.01               Mean   :0.01             
 3rd Qu.:0.00               3rd Qu.:0.00             
 Max.   :4.00               Max.   :9.00             
 NA's   :41519              NA's   :41519            
 AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
 Min.   :0.00               Min.   : 0.00             Min.   :  0.00           
 1st Qu.:0.00               1st Qu.: 0.00             1st Qu.:  0.00           
 Median :0.00               Median : 0.00             Median :  0.00           
 Mean   :0.03               Mean   : 0.27             Mean   :  0.27           
 3rd Qu.:0.00               3rd Qu.: 0.00             3rd Qu.:  0.00           
 Max.   :8.00               Max.   :27.00             Max.   :261.00           
 NA's   :41519              NA's   :41519             NA's   :41519            
 AMT_REQ_CREDIT_BUREAU_YEAR
 Min.   : 0.0              
 1st Qu.: 0.0              
 Median : 1.0              
 Mean   : 1.9              
 3rd Qu.: 3.0              
 Max.   :25.0              
 NA's   :41519             
Percentage of observations in the minority class: 8.07 %

Visualizing Class Imbalance on Original Dataset

To further understand the dataset, we visualize the distribution of the TARGET variable to illustrate any class imbalance. This plot will help us see the proportion of positive and negative cases in the original dataset.

EDA

To start our EDA (Exploratory Data Analysis), let’s perform the following steps:

  1. Check for Missing Values
  2. Summary Statistics for Numerical Features
  3. Distribution Plots for Key Numerical Variables
  4. Categorical Variable Analysis

1. Check for Missing Values

Understanding the amount of missing data in each column helps us plan our data cleaning and preprocessing steps.

                                                 Variable Missing
COMMONAREA_AVG                             COMMONAREA_AVG  214865
COMMONAREA_MODE                           COMMONAREA_MODE  214865
COMMONAREA_MEDI                           COMMONAREA_MEDI  214865
NONLIVINGAPARTMENTS_AVG           NONLIVINGAPARTMENTS_AVG  213514
NONLIVINGAPARTMENTS_MODE         NONLIVINGAPARTMENTS_MODE  213514
NONLIVINGAPARTMENTS_MEDI         NONLIVINGAPARTMENTS_MEDI  213514
FONDKAPREMONT_MODE                     FONDKAPREMONT_MODE  210295
LIVINGAPARTMENTS_AVG                 LIVINGAPARTMENTS_AVG  210199
LIVINGAPARTMENTS_MODE               LIVINGAPARTMENTS_MODE  210199
LIVINGAPARTMENTS_MEDI               LIVINGAPARTMENTS_MEDI  210199
FLOORSMIN_AVG                               FLOORSMIN_AVG  208642
FLOORSMIN_MODE                             FLOORSMIN_MODE  208642
FLOORSMIN_MEDI                             FLOORSMIN_MEDI  208642
YEARS_BUILD_AVG                           YEARS_BUILD_AVG  204488
YEARS_BUILD_MODE                         YEARS_BUILD_MODE  204488
YEARS_BUILD_MEDI                         YEARS_BUILD_MEDI  204488
OWN_CAR_AGE                                   OWN_CAR_AGE  202929
LANDAREA_AVG                                 LANDAREA_AVG  182590
LANDAREA_MODE                               LANDAREA_MODE  182590
LANDAREA_MEDI                               LANDAREA_MEDI  182590
BASEMENTAREA_AVG                         BASEMENTAREA_AVG  179943
BASEMENTAREA_MODE                       BASEMENTAREA_MODE  179943
BASEMENTAREA_MEDI                       BASEMENTAREA_MEDI  179943
EXT_SOURCE_1                                 EXT_SOURCE_1  173378
NONLIVINGAREA_AVG                       NONLIVINGAREA_AVG  169682
NONLIVINGAREA_MODE                     NONLIVINGAREA_MODE  169682
NONLIVINGAREA_MEDI                     NONLIVINGAREA_MEDI  169682
ELEVATORS_AVG                               ELEVATORS_AVG  163891
ELEVATORS_MODE                             ELEVATORS_MODE  163891
ELEVATORS_MEDI                             ELEVATORS_MEDI  163891
WALLSMATERIAL_MODE                     WALLSMATERIAL_MODE  156341
APARTMENTS_AVG                             APARTMENTS_AVG  156061
APARTMENTS_MODE                           APARTMENTS_MODE  156061
APARTMENTS_MEDI                           APARTMENTS_MEDI  156061
ENTRANCES_AVG                               ENTRANCES_AVG  154828
ENTRANCES_MODE                             ENTRANCES_MODE  154828
ENTRANCES_MEDI                             ENTRANCES_MEDI  154828
LIVINGAREA_AVG                             LIVINGAREA_AVG  154350
LIVINGAREA_MODE                           LIVINGAREA_MODE  154350
LIVINGAREA_MEDI                           LIVINGAREA_MEDI  154350
HOUSETYPE_MODE                             HOUSETYPE_MODE  154297
FLOORSMAX_AVG                               FLOORSMAX_AVG  153020
FLOORSMAX_MODE                             FLOORSMAX_MODE  153020
FLOORSMAX_MEDI                             FLOORSMAX_MEDI  153020
YEARS_BEGINEXPLUATATION_AVG   YEARS_BEGINEXPLUATATION_AVG  150007
YEARS_BEGINEXPLUATATION_MODE YEARS_BEGINEXPLUATATION_MODE  150007
YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_MEDI  150007
TOTALAREA_MODE                             TOTALAREA_MODE  148431
EMERGENCYSTATE_MODE                   EMERGENCYSTATE_MODE  145755
OCCUPATION_TYPE                           OCCUPATION_TYPE   96391
EXT_SOURCE_3                                 EXT_SOURCE_3   60965
AMT_REQ_CREDIT_BUREAU_HOUR     AMT_REQ_CREDIT_BUREAU_HOUR   41519
AMT_REQ_CREDIT_BUREAU_DAY       AMT_REQ_CREDIT_BUREAU_DAY   41519
AMT_REQ_CREDIT_BUREAU_WEEK     AMT_REQ_CREDIT_BUREAU_WEEK   41519
AMT_REQ_CREDIT_BUREAU_MON       AMT_REQ_CREDIT_BUREAU_MON   41519
AMT_REQ_CREDIT_BUREAU_QRT       AMT_REQ_CREDIT_BUREAU_QRT   41519
AMT_REQ_CREDIT_BUREAU_YEAR     AMT_REQ_CREDIT_BUREAU_YEAR   41519
NAME_TYPE_SUITE                           NAME_TYPE_SUITE    1292
OBS_30_CNT_SOCIAL_CIRCLE         OBS_30_CNT_SOCIAL_CIRCLE    1021
DEF_30_CNT_SOCIAL_CIRCLE         DEF_30_CNT_SOCIAL_CIRCLE    1021
OBS_60_CNT_SOCIAL_CIRCLE         OBS_60_CNT_SOCIAL_CIRCLE    1021
DEF_60_CNT_SOCIAL_CIRCLE         DEF_60_CNT_SOCIAL_CIRCLE    1021
EXT_SOURCE_2                                 EXT_SOURCE_2     660
AMT_GOODS_PRICE                           AMT_GOODS_PRICE     278
AMT_ANNUITY                                   AMT_ANNUITY      12
CNT_FAM_MEMBERS                           CNT_FAM_MEMBERS       2
DAYS_LAST_PHONE_CHANGE             DAYS_LAST_PHONE_CHANGE       1

2. Summary Statistics for Numerical Features

We’ll explore the summary statistics to get a sense of the range, central tendency, and spread of the numerical features.

   SK_ID_CURR         TARGET         CNT_CHILDREN     AMT_INCOME_TOTAL   
 Min.   :100002   Min.   :0.00000   Min.   : 0.0000   Min.   :    25650  
 1st Qu.:189146   1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.:   112500  
 Median :278202   Median :0.00000   Median : 0.0000   Median :   147150  
 Mean   :278181   Mean   :0.08073   Mean   : 0.4171   Mean   :   168798  
 3rd Qu.:367143   3rd Qu.:0.00000   3rd Qu.: 1.0000   3rd Qu.:   202500  
 Max.   :456255   Max.   :1.00000   Max.   :19.0000   Max.   :117000000  
                                                                         
   AMT_CREDIT       AMT_ANNUITY     AMT_GOODS_PRICE  
 Min.   :  45000   Min.   :  1616   Min.   :  40500  
 1st Qu.: 270000   1st Qu.: 16524   1st Qu.: 238500  
 Median : 513531   Median : 24903   Median : 450000  
 Mean   : 599026   Mean   : 27109   Mean   : 538396  
 3rd Qu.: 808650   3rd Qu.: 34596   3rd Qu.: 679500  
 Max.   :4050000   Max.   :258026   Max.   :4050000  
                   NA's   :12       NA's   :278      
 REGION_POPULATION_RELATIVE   DAYS_BIRTH     DAYS_EMPLOYED    DAYS_REGISTRATION
 Min.   :0.00029            Min.   :-25229   Min.   :-17912   Min.   :-24672   
 1st Qu.:0.01001            1st Qu.:-19682   1st Qu.: -2760   1st Qu.: -7480   
 Median :0.01885            Median :-15750   Median : -1213   Median : -4504   
 Mean   :0.02087            Mean   :-16037   Mean   : 63815   Mean   : -4986   
 3rd Qu.:0.02866            3rd Qu.:-12413   3rd Qu.:  -289   3rd Qu.: -2010   
 Max.   :0.07251            Max.   : -7489   Max.   :365243   Max.   :     0   
                                                                               
 DAYS_ID_PUBLISH  OWN_CAR_AGE       FLAG_MOBIL FLAG_EMP_PHONE  
 Min.   :-7197   Min.   : 0.00    Min.   :0    Min.   :0.0000  
 1st Qu.:-4299   1st Qu.: 5.00    1st Qu.:1    1st Qu.:1.0000  
 Median :-3254   Median : 9.00    Median :1    Median :1.0000  
 Mean   :-2994   Mean   :12.06    Mean   :1    Mean   :0.8199  
 3rd Qu.:-1720   3rd Qu.:15.00    3rd Qu.:1    3rd Qu.:1.0000  
 Max.   :    0   Max.   :91.00    Max.   :1    Max.   :1.0000  
                 NA's   :202929                                
 FLAG_WORK_PHONE  FLAG_CONT_MOBILE   FLAG_PHONE       FLAG_EMAIL     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :1.0000   Median :0.0000   Median :0.00000  
 Mean   :0.1994   Mean   :0.9981   Mean   :0.2811   Mean   :0.05672  
 3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
                                                                     
 CNT_FAM_MEMBERS  REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY
 Min.   : 1.000   Min.   :1.000        Min.   :1.000              
 1st Qu.: 2.000   1st Qu.:2.000        1st Qu.:2.000              
 Median : 2.000   Median :2.000        Median :2.000              
 Mean   : 2.153   Mean   :2.052        Mean   :2.032              
 3rd Qu.: 3.000   3rd Qu.:2.000        3rd Qu.:2.000              
 Max.   :20.000   Max.   :3.000        Max.   :3.000              
 NA's   :2                                                        
 HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
 Min.   : 0.00           Min.   :0.00000            Min.   :0.00000           
 1st Qu.:10.00           1st Qu.:0.00000            1st Qu.:0.00000           
 Median :12.00           Median :0.00000            Median :0.00000           
 Mean   :12.06           Mean   :0.01514            Mean   :0.05077           
 3rd Qu.:14.00           3rd Qu.:0.00000            3rd Qu.:0.00000           
 Max.   :23.00           Max.   :1.00000            Max.   :1.00000           
                                                                              
 LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY
 Min.   :0.00000             Min.   :0.00000        Min.   :0.0000        
 1st Qu.:0.00000             1st Qu.:0.00000        1st Qu.:0.0000        
 Median :0.00000             Median :0.00000        Median :0.0000        
 Mean   :0.04066             Mean   :0.07817        Mean   :0.2305        
 3rd Qu.:0.00000             3rd Qu.:0.00000        3rd Qu.:0.0000        
 Max.   :1.00000             Max.   :1.00000        Max.   :1.0000        
                                                                          
 LIVE_CITY_NOT_WORK_CITY  EXT_SOURCE_1     EXT_SOURCE_2     EXT_SOURCE_3  
 Min.   :0.0000          Min.   :0.01     Min.   :0.0000   Min.   :0.00   
 1st Qu.:0.0000          1st Qu.:0.33     1st Qu.:0.3925   1st Qu.:0.37   
 Median :0.0000          Median :0.51     Median :0.5660   Median :0.54   
 Mean   :0.1796          Mean   :0.50     Mean   :0.5144   Mean   :0.51   
 3rd Qu.:0.0000          3rd Qu.:0.68     3rd Qu.:0.6636   3rd Qu.:0.67   
 Max.   :1.0000          Max.   :0.96     Max.   :0.8550   Max.   :0.90   
                         NA's   :173378   NA's   :660      NA's   :60965  
 APARTMENTS_AVG   BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG 
 Min.   :0.00     Min.   :0.00     Min.   :0.00                Min.   :0.00    
 1st Qu.:0.06     1st Qu.:0.04     1st Qu.:0.98                1st Qu.:0.69    
 Median :0.09     Median :0.08     Median :0.98                Median :0.76    
 Mean   :0.12     Mean   :0.09     Mean   :0.98                Mean   :0.75    
 3rd Qu.:0.15     3rd Qu.:0.11     3rd Qu.:0.99                3rd Qu.:0.82    
 Max.   :1.00     Max.   :1.00     Max.   :1.00                Max.   :1.00    
 NA's   :156061   NA's   :179943   NA's   :150007              NA's   :204488  
 COMMONAREA_AVG   ELEVATORS_AVG    ENTRANCES_AVG    FLOORSMAX_AVG   
 Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
 1st Qu.:0.01     1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17    
 Median :0.02     Median :0.00     Median :0.14     Median :0.17    
 Mean   :0.04     Mean   :0.08     Mean   :0.15     Mean   :0.23    
 3rd Qu.:0.05     3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33    
 Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
 NA's   :214865   NA's   :163891   NA's   :154828   NA's   :153020  
 FLOORSMIN_AVG     LANDAREA_AVG    LIVINGAPARTMENTS_AVG LIVINGAREA_AVG  
 Min.   :0.00     Min.   :0.00     Min.   :0.00         Min.   :0.00    
 1st Qu.:0.08     1st Qu.:0.02     1st Qu.:0.05         1st Qu.:0.05    
 Median :0.21     Median :0.05     Median :0.08         Median :0.07    
 Mean   :0.23     Mean   :0.07     Mean   :0.10         Mean   :0.11    
 3rd Qu.:0.38     3rd Qu.:0.09     3rd Qu.:0.12         3rd Qu.:0.13    
 Max.   :1.00     Max.   :1.00     Max.   :1.00         Max.   :1.00    
 NA's   :208642   NA's   :182590   NA's   :210199       NA's   :154350  
 NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE  BASEMENTAREA_MODE
 Min.   :0.00            Min.   :0.00      Min.   :0.00     Min.   :0.00     
 1st Qu.:0.00            1st Qu.:0.00      1st Qu.:0.05     1st Qu.:0.04     
 Median :0.00            Median :0.00      Median :0.08     Median :0.07     
 Mean   :0.01            Mean   :0.03      Mean   :0.11     Mean   :0.09     
 3rd Qu.:0.00            3rd Qu.:0.03      3rd Qu.:0.14     3rd Qu.:0.11     
 Max.   :1.00            Max.   :1.00      Max.   :1.00     Max.   :1.00     
 NA's   :213514          NA's   :169682    NA's   :156061   NA's   :179943   
 YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE 
 Min.   :0.00                 Min.   :0.00     Min.   :0.00    
 1st Qu.:0.98                 1st Qu.:0.70     1st Qu.:0.01    
 Median :0.98                 Median :0.76     Median :0.02    
 Mean   :0.98                 Mean   :0.76     Mean   :0.04    
 3rd Qu.:0.99                 3rd Qu.:0.82     3rd Qu.:0.05    
 Max.   :1.00                 Max.   :1.00     Max.   :1.00    
 NA's   :150007               NA's   :204488   NA's   :214865  
 ELEVATORS_MODE   ENTRANCES_MODE   FLOORSMAX_MODE   FLOORSMIN_MODE  
 Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
 1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17     1st Qu.:0.08    
 Median :0.00     Median :0.14     Median :0.17     Median :0.21    
 Mean   :0.07     Mean   :0.15     Mean   :0.22     Mean   :0.23    
 3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33     3rd Qu.:0.38    
 Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
 NA's   :163891   NA's   :154828   NA's   :153020   NA's   :208642  
 LANDAREA_MODE    LIVINGAPARTMENTS_MODE LIVINGAREA_MODE 
 Min.   :0.00     Min.   :0.00          Min.   :0.00    
 1st Qu.:0.02     1st Qu.:0.05          1st Qu.:0.04    
 Median :0.05     Median :0.08          Median :0.07    
 Mean   :0.06     Mean   :0.11          Mean   :0.11    
 3rd Qu.:0.08     3rd Qu.:0.13          3rd Qu.:0.13    
 Max.   :1.00     Max.   :1.00          Max.   :1.00    
 NA's   :182590   NA's   :210199        NA's   :154350  
 NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI  BASEMENTAREA_MEDI
 Min.   :0.00             Min.   :0.00       Min.   :0.00     Min.   :0.00     
 1st Qu.:0.00             1st Qu.:0.00       1st Qu.:0.06     1st Qu.:0.04     
 Median :0.00             Median :0.00       Median :0.09     Median :0.08     
 Mean   :0.01             Mean   :0.03       Mean   :0.12     Mean   :0.09     
 3rd Qu.:0.00             3rd Qu.:0.02       3rd Qu.:0.15     3rd Qu.:0.11     
 Max.   :1.00             Max.   :1.00       Max.   :1.00     Max.   :1.00     
 NA's   :213514           NA's   :169682     NA's   :156061   NA's   :179943   
 YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI 
 Min.   :0.00                 Min.   :0.00     Min.   :0.00    
 1st Qu.:0.98                 1st Qu.:0.69     1st Qu.:0.01    
 Median :0.98                 Median :0.76     Median :0.02    
 Mean   :0.98                 Mean   :0.76     Mean   :0.04    
 3rd Qu.:0.99                 3rd Qu.:0.83     3rd Qu.:0.05    
 Max.   :1.00                 Max.   :1.00     Max.   :1.00    
 NA's   :150007               NA's   :204488   NA's   :214865  
 ELEVATORS_MEDI   ENTRANCES_MEDI   FLOORSMAX_MEDI   FLOORSMIN_MEDI  
 Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
 1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17     1st Qu.:0.08    
 Median :0.00     Median :0.14     Median :0.17     Median :0.21    
 Mean   :0.08     Mean   :0.15     Mean   :0.23     Mean   :0.23    
 3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33     3rd Qu.:0.38    
 Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
 NA's   :163891   NA's   :154828   NA's   :153020   NA's   :208642  
 LANDAREA_MEDI    LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI 
 Min.   :0.00     Min.   :0.00          Min.   :0.00    
 1st Qu.:0.02     1st Qu.:0.05          1st Qu.:0.05    
 Median :0.05     Median :0.08          Median :0.07    
 Mean   :0.07     Mean   :0.10          Mean   :0.11    
 3rd Qu.:0.09     3rd Qu.:0.12          3rd Qu.:0.13    
 Max.   :1.00     Max.   :1.00          Max.   :1.00    
 NA's   :182590   NA's   :210199        NA's   :154350  
 NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE  
 Min.   :0.00             Min.   :0.00       Min.   :0.00    
 1st Qu.:0.00             1st Qu.:0.00       1st Qu.:0.04    
 Median :0.00             Median :0.00       Median :0.07    
 Mean   :0.01             Mean   :0.03       Mean   :0.10    
 3rd Qu.:0.00             3rd Qu.:0.03       3rd Qu.:0.13    
 Max.   :1.00             Max.   :1.00       Max.   :1.00    
 NA's   :213514           NA's   :169682     NA's   :148431  
 OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
 Min.   :  0.000          Min.   : 0.0000          Min.   :  0.000         
 1st Qu.:  0.000          1st Qu.: 0.0000          1st Qu.:  0.000         
 Median :  0.000          Median : 0.0000          Median :  0.000         
 Mean   :  1.422          Mean   : 0.1434          Mean   :  1.405         
 3rd Qu.:  2.000          3rd Qu.: 0.0000          3rd Qu.:  2.000         
 Max.   :348.000          Max.   :34.0000          Max.   :344.000         
 NA's   :1021             NA's   :1021             NA's   :1021            
 DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2   
 Min.   : 0.0             Min.   :-4292.0        Min.   :0.00e+00  
 1st Qu.: 0.0             1st Qu.:-1570.0        1st Qu.:0.00e+00  
 Median : 0.0             Median : -757.0        Median :0.00e+00  
 Mean   : 0.1             Mean   : -962.9        Mean   :4.23e-05  
 3rd Qu.: 0.0             3rd Qu.: -274.0        3rd Qu.:0.00e+00  
 Max.   :24.0             Max.   :    0.0        Max.   :1.00e+00  
 NA's   :1021             NA's   :1                                
 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4    FLAG_DOCUMENT_5   FLAG_DOCUMENT_6  
 Min.   :0.00    Min.   :0.00e+00   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00    1st Qu.:0.00e+00   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :1.00    Median :0.00e+00   Median :0.00000   Median :0.00000  
 Mean   :0.71    Mean   :8.13e-05   Mean   :0.01511   Mean   :0.08806  
 3rd Qu.:1.00    3rd Qu.:0.00e+00   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00    Max.   :1.00e+00   Max.   :1.00000   Max.   :1.00000  
                                                                       
 FLAG_DOCUMENT_7     FLAG_DOCUMENT_8   FLAG_DOCUMENT_9    FLAG_DOCUMENT_10  
 Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00e+00  
 1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00e+00  
 Median :0.0000000   Median :0.00000   Median :0.000000   Median :0.00e+00  
 Mean   :0.0001919   Mean   :0.08138   Mean   :0.003896   Mean   :2.28e-05  
 3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00e+00  
 Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00e+00  
                                                                            
 FLAG_DOCUMENT_11   FLAG_DOCUMENT_12  FLAG_DOCUMENT_13   FLAG_DOCUMENT_14  
 Min.   :0.000000   Min.   :0.0e+00   Min.   :0.000000   Min.   :0.000000  
 1st Qu.:0.000000   1st Qu.:0.0e+00   1st Qu.:0.000000   1st Qu.:0.000000  
 Median :0.000000   Median :0.0e+00   Median :0.000000   Median :0.000000  
 Mean   :0.003912   Mean   :6.5e-06   Mean   :0.003525   Mean   :0.002936  
 3rd Qu.:0.000000   3rd Qu.:0.0e+00   3rd Qu.:0.000000   3rd Qu.:0.000000  
 Max.   :1.000000   Max.   :1.0e+00   Max.   :1.000000   Max.   :1.000000  
                                                                           
 FLAG_DOCUMENT_15  FLAG_DOCUMENT_16   FLAG_DOCUMENT_17    FLAG_DOCUMENT_18 
 Min.   :0.00000   Min.   :0.000000   Min.   :0.0000000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000000   1st Qu.:0.00000  
 Median :0.00000   Median :0.000000   Median :0.0000000   Median :0.00000  
 Mean   :0.00121   Mean   :0.009928   Mean   :0.0002667   Mean   :0.00813  
 3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.000000   Max.   :1.0000000   Max.   :1.00000  
                                                                           
 FLAG_DOCUMENT_19    FLAG_DOCUMENT_20    FLAG_DOCUMENT_21   
 Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
 1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000  
 Median :0.0000000   Median :0.0000000   Median :0.0000000  
 Mean   :0.0005951   Mean   :0.0005073   Mean   :0.0003349  
 3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000  
 Max.   :1.0000000   Max.   :1.0000000   Max.   :1.0000000  
                                                            
 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
 Min.   :0.00               Min.   :0.00             
 1st Qu.:0.00               1st Qu.:0.00             
 Median :0.00               Median :0.00             
 Mean   :0.01               Mean   :0.01             
 3rd Qu.:0.00               3rd Qu.:0.00             
 Max.   :4.00               Max.   :9.00             
 NA's   :41519              NA's   :41519            
 AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
 Min.   :0.00               Min.   : 0.00             Min.   :  0.00           
 1st Qu.:0.00               1st Qu.: 0.00             1st Qu.:  0.00           
 Median :0.00               Median : 0.00             Median :  0.00           
 Mean   :0.03               Mean   : 0.27             Mean   :  0.27           
 3rd Qu.:0.00               3rd Qu.: 0.00             3rd Qu.:  0.00           
 Max.   :8.00               Max.   :27.00             Max.   :261.00           
 NA's   :41519              NA's   :41519             NA's   :41519            
 AMT_REQ_CREDIT_BUREAU_YEAR
 Min.   : 0.0              
 1st Qu.: 0.0              
 Median : 1.0              
 Mean   : 1.9              
 3rd Qu.: 3.0              
 Max.   :25.0              
 NA's   :41519             

3. Distribution Plots for Key Numerical Variables

Visualizing the distribution of key numerical variables can help us detect skewness, outliers, and the need for transformations.

4. Categorical Variable Analysis

Visualizing the distribution of categorical variables helps us understand the frequency of different categories.

PCA Analysis

Let’s perform PCA (Principal Component Analysis) to identify the variance captured by the principal components and visualize it using scree plots and cumulative explained variance. This will help us assess whether we can reduce dimensionality by leveraging strong linear relationships in the data.

Steps:

  1. Data Preparation: We will preprocess the data, focusing on scaling numeric columns.
  2. PCA Computation: Perform PCA on the standardized numeric data.
  3. Scree Plot: Plot the explained variance of each principal component.
  4. Cumulative Explained Variance Plot: Visualize the cumulative variance explained to determine the number of components needed to capture most of the variance.

1. Data Preparation

First, let’s preprocess the data by selecting numeric columns and standardizing them.

2. PCA Computation

Now, let’s perform PCA on the scaled numeric data.

3. Scree Plot

We’ll plot the variance explained by each principal component.

4. Cumulative Explained Variance Plot

Next, we’ll visualize the cumulative explained variance to determine how many components explain a significant portion of the variance.

Interpretation

  • Scree Plot: This plot will show how much variance each principal component explains. Look for the “elbow” point where adding more components yields diminishing returns in explained variance.

  • Cumulative Explained Variance Plot: This will help identify the number of principal components that capture a desired threshold (e.g., 90%) of the total variance.

Based on these plots, our columns in the data do not show strong linear relationships between one another since based on our scree plot the first Principal Component only obtains 20% of the variation in the data. In order to obtain 90% of the data we would need to acquire about 50 principal components. Therefore a PCA will not be a beneficial transformation to perform due to the more complex relationships in the data especially considering that we have a class imbalance, it is crucial that we preserve our data.

UMAP (Uniform Manifold Approximation and Projection) Analysis

I attempted to use UMAP (Uniform Manifold Approximation and Projection) to explore whether it could cluster our credit_data more effectively than PCA, potentially capturing more complex relationships in a low-dimensional projection. UMAP is particularly useful when data has non-linear relationships that PCA might not capture due to its linear nature. By applying UMAP, I aimed to visualize the data in 2D and 3D spaces to check for any natural clusters that might emerge, especially given the imbalanced nature of our target variable.

Steps

  1. Data Preparation:
    • I performed one-hot encoding on the categorical variables and dropped columns with over 100,000 missing values. I also removed rows with missing data.
    • To deal with class imbalance, I downsampled the data to create a balanced subset for the analysis.
  2. UMAP Implementation:
    • I performed UMAP dimensionality reduction with specified parameters such as n_neighbors and min_dist to project the data into two and three dimensions.
    • I plotted the UMAP results using both 2D and 3D visualizations to observe how well the data clustered according to the target variable.
  3. Hyperparameter Tuning:
    • I explored different hyperparameter combinations for UMAP by looping through various values of n_neighbors, min_dist, and spread.
    • For each combination, I generated and saved plots to visualize how changes in these parameters affected the clustering.
  4. Comparison with PCA:
    • By using UMAP, I sought to explore potential non-linear relationships and more complex clustering structures that PCA might miss, especially in cases where linear assumptions do not hold strongly.

Here are some of the plots that were generated by this transformation:

Note: For more plots with different hyper parameters check out the plots folder in the repo

UMAP is a powerful tool for visualizing high-dimensional data, and it complements PCA well, especially when the data exhibits non-linear relationships that are not fully captured by linear techniques like PCA. However, even though these transformations fail to reveal underlying clusters within our data, they still motivate the exploration component in EDA. For future reference, we can experiment with PCA -> UMAP -> KNN model or UMAP -> KNN model because while our UMAP did not result in global clusters, locally there were some cluster that could be observed. This may be something we look into further as we could build a KNN classifier with small k to pick up on the regional patterns that are being recognized in our UMAP.

Feature Selection

Based on our inspection, we will select specific categorical and numerical columns for our logistic regression, knn, and Random Forest model. The chosen categorical features include: Certainly! Here’s the description of each feature in the same format:

Categorical Features:

  • CODE_GENDER: Gender of the applicant (e.g., “M” for Male, “F” for Female).
  • NAME_CONTRACT_TYPE: Type of loan contract (e.g., “Cash loans,” “Revolving loans”).
  • FLAG_OWN_CAR: Indicates car ownership (“Y” for Yes, “N” for No).
  • FLAG_OWN_REALTY: Indicates real estate ownership (“Y” for Yes, “N” for No).
  • NAME_INCOME_TYPE: Type of income of the applicant (e.g., “Working,” “Commercial associate,” “Pensioner”).
  • NAME_EDUCATION_TYPE: Educational background of the applicant (e.g., “Higher education,” “Secondary education,” “Incomplete higher”).
  • NAME_FAMILY_STATUS: Family status of the applicant (e.g., “Married,” “Single / not married,” “Divorced”).
  • NAME_HOUSING_TYPE: Housing situation of the applicant (e.g., “House / apartment,” “With parents,” “Municipal apartment”).
  • WEEKDAY_APPR_PROCESS_START: Day of the week when the application process started (e.g., “Monday,” “Tuesday”).
  • REG_REGION_NOT_LIVE_REGION: Indicates if the applicant’s region is not the same as the registration region (“Y” for Yes, “N” for No).

Numerical Features:

  • AMT_ANNUITY: Annual loan payment amount.
  • AMT_CREDIT: Total credit amount provided.
  • CNT_CHILDREN: Number of children or dependents.
  • AMT_INCOME_TOTAL: Total annual income of the applicant.
  • AMT_GOODS_PRICE: Price of the goods the loan is taken for.
  • DAYS_EMPLOYED: Number of days since the applicant was last employed (negative values represent days before the application date).
  • DAYS_REGISTRATION: Number of days since the applicant registered their residence (negative values represent days before the application date).
  • DAYS_BIRTH: Age of the applicant in days (negative values represent days before the application date).
  • AMT_REQ_CREDIT_BUREAU_HOUR: Number of credit bureau requests in the past hour.
  • AMT_REQ_CREDIT_BUREAU_DAY: Number of credit bureau requests in the past day.
  • AMT_REQ_CREDIT_BUREAU_WEEK: Number of credit bureau requests in the past week.
  • AMT_REQ_CREDIT_BUREAU_MON: Number of credit bureau requests in the past month.
  • AMT_REQ_CREDIT_BUREAU_QRT: Number of credit bureau requests in the past quarter.
  • AMT_REQ_CREDIT_BUREAU_YEAR: Number of credit bureau requests in the past year.
  • OBS_30_CNT_SOCIAL_CIRCLE: Number of social circle members with 30 or more days overdue on credit.
  • DEF_30_CNT_SOCIAL_CIRCLE: Number of social circle members who defaulted in the past 30 days.
  • OBS_60_CNT_SOCIAL_CIRCLE: Number of social circle members with 60 or more days overdue on credit.
  • DEF_60_CNT_SOCIAL_CIRCLE: Number of social circle members who defaulted in the past 60 days.
  • DAYS_LAST_PHONE_CHANGE: Number of days since the applicant last changed their phone number.

These features are chosen based on their potential relevance to the credit risk prediction as I believe these would be key indicators to analyze before issuing a loan to somebody.

Selecting the relevant columns from the dataset to focus on key variables for analysis.

                    TARGET                CODE_GENDER 
                         0                          0 
        NAME_CONTRACT_TYPE               FLAG_OWN_CAR 
                         0                          0 
           FLAG_OWN_REALTY           NAME_INCOME_TYPE 
                         0                          0 
       NAME_EDUCATION_TYPE         NAME_FAMILY_STATUS 
                         0                          0 
         NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START 
                         0                          0 
REG_REGION_NOT_LIVE_REGION                AMT_ANNUITY 
                         0                         12 
                AMT_CREDIT               CNT_CHILDREN 
                         0                          0 
          AMT_INCOME_TOTAL            AMT_GOODS_PRICE 
                         0                        278 
             DAYS_EMPLOYED          DAYS_REGISTRATION 
                         0                          0 
                DAYS_BIRTH AMT_REQ_CREDIT_BUREAU_HOUR 
                         0                      41519 
 AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK 
                     41519                      41519 
 AMT_REQ_CREDIT_BUREAU_MON  AMT_REQ_CREDIT_BUREAU_QRT 
                     41519                      41519 
AMT_REQ_CREDIT_BUREAU_YEAR   OBS_30_CNT_SOCIAL_CIRCLE 
                     41519                       1021 
  DEF_30_CNT_SOCIAL_CIRCLE   OBS_60_CNT_SOCIAL_CIRCLE 
                      1021                       1021 
  DEF_60_CNT_SOCIAL_CIRCLE     DAYS_LAST_PHONE_CHANGE 
                      1021                          1 
[1] 264898     30

Assessing the amount of missing data in the selected columns and removing rows with any missing values to prepare the data for further analysis.

                    TARGET                CODE_GENDER 
                         0                          4 
        NAME_CONTRACT_TYPE               FLAG_OWN_CAR 
                         0                          0 
           FLAG_OWN_REALTY           NAME_INCOME_TYPE 
                         0                          0 
       NAME_EDUCATION_TYPE         NAME_FAMILY_STATUS 
                         0                          0 
         NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START 
                         0                          0 
REG_REGION_NOT_LIVE_REGION                AMT_ANNUITY 
                         0                          0 
                AMT_CREDIT               CNT_CHILDREN 
                         0                          0 
          AMT_INCOME_TOTAL            AMT_GOODS_PRICE 
                         0                          0 
             DAYS_EMPLOYED          DAYS_REGISTRATION 
                         0                          0 
                DAYS_BIRTH AMT_REQ_CREDIT_BUREAU_HOUR 
                         0                          0 
 AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK 
                         0                          0 
 AMT_REQ_CREDIT_BUREAU_MON  AMT_REQ_CREDIT_BUREAU_QRT 
                         0                          0 
AMT_REQ_CREDIT_BUREAU_YEAR   OBS_30_CNT_SOCIAL_CIRCLE 
                         0                          0 
  DEF_30_CNT_SOCIAL_CIRCLE   OBS_60_CNT_SOCIAL_CIRCLE 
                         0                          0 
  DEF_60_CNT_SOCIAL_CIRCLE     DAYS_LAST_PHONE_CHANGE 
                         0                          0 

Filtering the dataset to remove rows with specific unwanted values and examining the target variable’s distribution to understand the class imbalance.


     0      1 
244403  20489 
Percentage of observations in the minority class: 7.73 %

Visualizing Class Imbalance on Cleaned Dataset

To further understand the dataset, we visualize the distribution of the TARGET variable to illustrate any class imbalance. This plot will help us see the proportion of positive and negative cases in the Cleaned dataset. It is important to note that by dropping the observations with NA values our class imbalance decreased by 0.34% which is significant considering 8.07% was the percentage of the minority class in the original dataset. For future development, one alternative to dropping observations with NA values can be to simply replace them with the median (for numeric features) or mode (for categorical features) response in that particular feature column.

Data Sampling

Random Sampling for Balanced Dataset

To enhance the efficiency of model training and ensure a manageable dataset size, we randomly sample the data to achieve a total of 30,000 observations. This process involves calculating the proportion of each class in the original dataset and then determining the number of samples needed from each class to maintain the original distribution.

To improve training efficiency and manage the size of the dataset, we downsample it to 30,000 observations. We first calculate the proportion of each class in the original dataset and then determine how many samples to draw from each class to maintain these proportions in the downsampled dataset. By setting a seed, we ensure that the sampling process is reproducible. The sampling function extracts the required number of samples for each class, and the results are combined into a single dataset. This balanced dataset is then saved for use in subsequent model training, allowing for more efficient and focused analysis.

Visualizing Class Imbalance on Downsampled Dataset

To further understand the downsampled dataset, we visualize the distribution of the TARGET variable to illustrate any class imbalance and show that it resembles that of the original dataset. This plot will help us see the proportion of positive and negative cases in the Downsampled dataset.

Percentage of observations in the minority class: 8.07 %

As we can see the downsampled data target variable distribution mimics the original data target target variable distribution very well because we sampled based off the original populations target variable distribution. However, this does not save us from the large gap between classes in the target variable. This naturally leads us to consider upsampling on our minority class to prevent the model from getting over influenced from the majority class.

Visualizing Feature Distributions

To better understand the distribution of our feature variables, we plot histograms. This helps us identify the distribution patterns and the need for any transformations.

We define three functions for plotting histograms. The plot_histogram_logs function applies a log transformation to better visualize variables with skewed distributions, making them easier to interpret. The plot_histogram function plots the distribution of a variable without transformation. The new plot_histogram_power function applies a power transformation (such as square root) to visualize variables that benefit from reducing skewness. These plots help us understand the distribution and skewness of each feature, which can inform our decisions on necessary data transformations or preprocessing steps for the modeling phase.

Visualizing Numeric Variable Distributions

To gain insights into the distribution of numeric variables, we generate histograms for each variable. This helps us understand their distributions and decide if any transformations are needed.

Histograms of Numeric Variables

First, we plot histograms for all numeric variables without applying any transformations.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]


[[12]]


[[13]]


[[14]]


[[15]]


[[16]]


[[17]]


[[18]]


[[19]]


[[20]]


[[21]]

These histograms provide a visual representation of the distribution of each numeric variable. They reveal the general shape of the data, including any skewness or extreme values.

Next, we apply a log transformation to these variables and plot the histograms again.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]


[[12]]


[[13]]


[[14]]


[[15]]


[[16]]


[[17]]


[[18]]


[[19]]


[[20]]


[[21]]

These histograms provide a visual representation of the distribution of each numeric variable with a log transformation. With this transformation, we see that some numerical features benefit greatly from this, which we will cover later on.

Next, we apply a square root transformation to these variables and plot the histograms again.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]


[[12]]


[[13]]


[[14]]


[[15]]


[[16]]


[[17]]


[[18]]


[[19]]


[[20]]


[[21]]

Benefits of Log and Square Root Transformations

Log Transformation

The histograms with log transformation offer a clearer view of the data distribution, especially for variables with skewed distributions or extreme values. Variables like AMT_CREDIT and AMT_INCOME_TOTAL often exhibit a right-skewed distribution that can be better normalized using a log transformation. This makes the data more suitable for modeling, as many machine learning algorithms perform better with features that approximate a normal distribution.

Variables Transformed Using Log: - AMT_ANNUITY: Reduces the impact of high values and normalizes the distribution. - AMT_CREDIT: Addresses right skewness and helps in stabilizing variance. - AMT_INCOME_TOTAL: Reduces extreme values and helps in normalizing income distribution. - AMT_GOODS_PRICE: Normalizes high-value outliers and improves data distribution. - DAYS_EMPLOYED: Helps to reduce the impact of extreme values related to employment duration.

Square Root Transformation

Square root transformation is effective for variables with a distribution that exhibits moderate skewness, particularly when the values are non-negative and have a range of scales. It reduces the impact of large values and stabilizes variance, making the data more manageable for modeling.

Variables Transformed Using Square Root: - DAYS_REGISTRATION: Reduces skewness and normalizes the distribution of registration days. - DAYS_BIRTH: Addresses skewness and makes age-related data more suitable for modeling.

Note: The CNT_CHILDREN along with other variables does not benefit significantly from log transformation due to its discrete nature and relatively consistent range of values. Therefore, it is left unchanged in our transformation process.

By applying these transformations, we aim to improve the distribution of our features, making them more appropriate for machine learning models and improving overall model performance.

Data Preparation and Balancing

Splitting the Data

To build and evaluate our predictive model, we first split the downsampled dataset into training and testing sets. We use an 80-20 split, ensuring that both sets maintain a representative distribution of the target variable.

Handling Class Imbalance

In our dataset, the TARGET variable is extremely imbalanced, meaning that the number of non-default cases far exceeds the number of default cases. This imbalance can lead to biased models that favor the majority class. To address this, we use the ROSE (Random Over-Sampling Examples) library to upsample the minority class, increasing its representation in the training data so that the model can learn about the minority class.


    0     1 
22061  7939 

Percentage of observations in the minority class: 26.46 %

By generating more examples of the minority class (fraud cases), we ensure that the model learns about both classes more effectively. This helps prevent the majority class from overwhelming the model’s learning process and improves the model’s ability to detect fraud.

Preprocessing Recipes and Model Workflows

In this section, we define and apply preprocessing recipes for our models. These recipes ensure that the data is properly prepared before training. We will cover logistic regression, k-Nearest Neighbors (KNN), and Random Forest (RF) models. We are going to use ROC_AUC

Logistic Regression

Recipe Definition

For the logistic regression model, we create a recipe that includes:

  • One-Hot Encoding: Converts categorical variables into a binary matrix.
  • Zero Variance Removal: Removes predictors with no variance.
  • Log Transformation: Applies a log transformation to skewed numerical variables to improve normality.
  • Centering and Scaling: Centers and scales numerical predictors to standardize them.

Apply the Recipe

Before training, we apply the recipes to ensure that preprocessing is correctly applied.

tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
 $ REG_REGION_NOT_LIVE_REGION                       : num [1:30000] -0.12 -0.12 -0.12 -0.12 -0.12 ...
 $ AMT_ANNUITY                                      : num [1:30000] -0.0613 -1.4863 0.5548 -0.6555 0.5697 ...
 $ AMT_CREDIT                                       : num [1:30000] -0.1 -1.072 0.716 0.168 0.634 ...
 $ CNT_CHILDREN                                     : num [1:30000] -0.589 -0.589 0.796 -0.589 -0.589 ...
 $ AMT_INCOME_TOTAL                                 : num [1:30000] -1.07 0.812 0.812 -1.07 -0.612 ...
 $ AMT_GOODS_PRICE                                  : num [1:30000] 0.054 -0.917 0.712 0.054 0.631 ...
 $ DAYS_EMPLOYED                                    : num [1:30000] -0.00439 -0.28007 0.09285 1.94306 1.94306 ...
 $ DAYS_REGISTRATION                                : num [1:30000] 1.708 -1.341 0.937 1.617 1.605 ...
 $ DAYS_BIRTH                                       : num [1:30000] -0.548 0.621 0.899 1.356 1.296 ...
 $ AMT_REQ_CREDIT_BUREAU_HOUR                       : num [1:30000] -0.0731 -0.0731 -0.0731 -0.0731 -0.0731 ...
 $ AMT_REQ_CREDIT_BUREAU_DAY                        : num [1:30000] -0.0632 -0.0632 -0.0632 -0.0632 -0.0632 ...
 $ AMT_REQ_CREDIT_BUREAU_WEEK                       : num [1:30000] -0.166 -0.166 -0.166 -0.166 -0.166 ...
 $ AMT_REQ_CREDIT_BUREAU_MON                        : num [1:30000] -0.294 -0.294 -0.294 -0.294 -0.294 ...
 $ AMT_REQ_CREDIT_BUREAU_QRT                        : num [1:30000] -0.427 -0.427 2.854 1.214 1.214 ...
 $ AMT_REQ_CREDIT_BUREAU_YEAR                       : num [1:30000] -1.0164 0.5839 0.5839 -0.483 0.0505 ...
 $ OBS_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.18 -0.611 -0.18 1.546 0.252 ...
 $ DEF_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.325 -0.325 1.985 1.985 -0.325 ...
 $ OBS_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.174 -0.609 -0.174 1.565 0.261 ...
 $ DEF_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.278 -0.278 2.581 -0.278 -0.278 ...
 $ DAYS_LAST_PHONE_CHANGE                           : num [1:30000] -2.365 0.378 0.634 0.296 0.29 ...
 $ TARGET                                           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ CODE_GENDER_M                                    : num [1:30000] 0 1 0 0 0 0 0 1 0 0 ...
 $ NAME_CONTRACT_TYPE_Revolving.loans               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ FLAG_OWN_CAR_Y                                   : num [1:30000] 0 1 0 0 0 0 0 1 0 0 ...
 $ FLAG_OWN_REALTY_Y                                : num [1:30000] 0 0 1 1 1 1 1 1 1 0 ...
 $ NAME_INCOME_TYPE_Pensioner                       : num [1:30000] 0 0 0 1 1 1 0 0 1 1 ...
 $ NAME_INCOME_TYPE_State.servant                   : num [1:30000] 0 0 0 0 0 0 1 0 0 0 ...
 $ NAME_INCOME_TYPE_Student                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_INCOME_TYPE_Working                         : num [1:30000] 1 1 1 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Higher.education             : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Incomplete.higher            : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Lower.secondary              : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
 $ NAME_FAMILY_STATUS_Married                       : num [1:30000] 1 0 0 0 0 1 0 1 1 1 ...
 $ NAME_FAMILY_STATUS_Separated                     : num [1:30000] 0 0 1 0 1 0 0 0 0 0 ...
 $ NAME_FAMILY_STATUS_Single...not.married          : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
 $ NAME_FAMILY_STATUS_Widow                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_House...apartment              : num [1:30000] 0 1 1 1 1 1 0 1 1 1 ...
 $ NAME_HOUSING_TYPE_Municipal.apartment            : num [1:30000] 0 0 0 0 0 0 1 0 0 0 ...
 $ NAME_HOUSING_TYPE_Office.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_Rented.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_With.parents                   : num [1:30000] 1 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_MONDAY                : num [1:30000] 0 0 1 0 0 1 0 0 1 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SATURDAY              : num [1:30000] 0 0 0 0 0 0 1 1 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SUNDAY                : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_THURSDAY              : num [1:30000] 0 1 0 0 0 0 0 0 0 1 ...
 $ WEEKDAY_APPR_PROCESS_START_TUESDAY               : num [1:30000] 1 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_WEDNESDAY             : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
 $ REG_REGION_NOT_LIVE_REGION                       : num [1:30000] -0.12 -0.12 -0.12 -0.12 -0.12 ...
 $ AMT_ANNUITY                                      : num [1:30000] -0.299 -1.163 0.338 -0.741 0.357 ...
 $ AMT_CREDIT                                       : num [1:30000] -0.392 -0.947 0.485 -0.158 0.373 ...
 $ CNT_CHILDREN                                     : num [1:30000] -0.589 -0.589 0.796 -0.589 -0.589 ...
 $ AMT_INCOME_TOTAL                                 : num [1:30000] -0.8 0.526 0.526 -0.8 -0.579 ...
 $ AMT_GOODS_PRICE                                  : num [1:30000] -0.261 -0.866 0.465 -0.261 0.357 ...
 $ DAYS_EMPLOYED                                    : num [1:30000] -0.481 -0.467 -0.488 2.131 2.131 ...
 $ DAYS_REGISTRATION                                : num [1:30000] -2.134 1.187 -0.921 -1.979 -1.958 ...
 $ DAYS_BIRTH                                       : num [1:30000] 0.598 -0.581 -0.889 -1.417 -1.347 ...
 $ AMT_REQ_CREDIT_BUREAU_HOUR                       : num [1:30000] -0.0731 -0.0731 -0.0731 -0.0731 -0.0731 ...
 $ AMT_REQ_CREDIT_BUREAU_DAY                        : num [1:30000] -0.0632 -0.0632 -0.0632 -0.0632 -0.0632 ...
 $ AMT_REQ_CREDIT_BUREAU_WEEK                       : num [1:30000] -0.166 -0.166 -0.166 -0.166 -0.166 ...
 $ AMT_REQ_CREDIT_BUREAU_MON                        : num [1:30000] -0.294 -0.294 -0.294 -0.294 -0.294 ...
 $ AMT_REQ_CREDIT_BUREAU_QRT                        : num [1:30000] -0.427 -0.427 2.854 1.214 1.214 ...
 $ AMT_REQ_CREDIT_BUREAU_YEAR                       : num [1:30000] -1.0164 0.5839 0.5839 -0.483 0.0505 ...
 $ OBS_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.18 -0.611 -0.18 1.546 0.252 ...
 $ DEF_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.325 -0.325 1.985 1.985 -0.325 ...
 $ OBS_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.174 -0.609 -0.174 1.565 0.261 ...
 $ DEF_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.278 -0.278 2.581 -0.278 -0.278 ...
 $ DAYS_LAST_PHONE_CHANGE                           : num [1:30000] -2.365 0.378 0.634 0.296 0.29 ...
 $ TARGET                                           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ CODE_GENDER_M                                    : num [1:30000] 0 1 0 0 0 0 0 1 0 0 ...
 $ NAME_CONTRACT_TYPE_Revolving.loans               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ FLAG_OWN_CAR_Y                                   : num [1:30000] 0 1 0 0 0 0 0 1 0 0 ...
 $ FLAG_OWN_REALTY_Y                                : num [1:30000] 0 0 1 1 1 1 1 1 1 0 ...
 $ NAME_INCOME_TYPE_Pensioner                       : num [1:30000] 0 0 0 1 1 1 0 0 1 1 ...
 $ NAME_INCOME_TYPE_State.servant                   : num [1:30000] 0 0 0 0 0 0 1 0 0 0 ...
 $ NAME_INCOME_TYPE_Student                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_INCOME_TYPE_Working                         : num [1:30000] 1 1 1 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Higher.education             : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Incomplete.higher            : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Lower.secondary              : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
 $ NAME_FAMILY_STATUS_Married                       : num [1:30000] 1 0 0 0 0 1 0 1 1 1 ...
 $ NAME_FAMILY_STATUS_Separated                     : num [1:30000] 0 0 1 0 1 0 0 0 0 0 ...
 $ NAME_FAMILY_STATUS_Single...not.married          : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
 $ NAME_FAMILY_STATUS_Widow                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_House...apartment              : num [1:30000] 0 1 1 1 1 1 0 1 1 1 ...
 $ NAME_HOUSING_TYPE_Municipal.apartment            : num [1:30000] 0 0 0 0 0 0 1 0 0 0 ...
 $ NAME_HOUSING_TYPE_Office.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_Rented.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_With.parents                   : num [1:30000] 1 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_MONDAY                : num [1:30000] 0 0 1 0 0 1 0 0 1 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SATURDAY              : num [1:30000] 0 0 0 0 0 0 1 1 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SUNDAY                : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_THURSDAY              : num [1:30000] 0 1 0 0 0 0 0 0 0 1 ...
 $ WEEKDAY_APPR_PROCESS_START_TUESDAY               : num [1:30000] 1 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_WEDNESDAY             : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...

Model Workflow

We define and combine the logistic regression model with the preprocessing recipe into a workflow.

k-Nearest Neighbors (KNN)

Recipe Definition

For KNN, we create a recipe that includes:

  • One-Hot Encoding: Converts categorical variables.
  • Normalization: Normalizes numerical predictors to a standard range.
  • Centering and Scaling: Centers and scales numerical predictors to ensure they contribute equally to the distance calculations in KNN.

Apply the Recipe

Apply the KNN recipes to ensure all preprocessing steps are correctly implemented.

tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
 $ REG_REGION_NOT_LIVE_REGION                       : num [1:30000] -0.122 -0.122 -0.122 -0.122 -0.122 ...
 $ AMT_ANNUITY                                      : num [1:30000] -1.505 0.555 0.57 -0.259 0.37 ...
 $ AMT_CREDIT                                       : num [1:30000] -1.078 0.738 0.656 0.222 -0.268 ...
 $ CNT_CHILDREN                                     : num [1:30000] -0.59 0.77 -0.59 -0.59 -0.59 ...
 $ AMT_INCOME_TOTAL                                 : num [1:30000] 0.827 0.827 -0.62 -1.578 0.827 ...
 $ AMT_GOODS_PRICE                                  : num [1:30000] -0.914 0.741 0.658 0.128 -0.281 ...
 $ DAYS_EMPLOYED                                    : num [1:30000] -0.237 0.142 2.021 2.021 -0.228 ...
 $ DAYS_REGISTRATION                                : num [1:30000] -1.316 0.966 1.635 1.306 -0.886 ...
 $ DAYS_BIRTH                                       : num [1:30000] 0.673 0.949 1.343 1.221 0.682 ...
 $ AMT_REQ_CREDIT_BUREAU_HOUR                       : num [1:30000] -0.0718 -0.0718 -0.0718 -0.0718 -0.0718 ...
 $ AMT_REQ_CREDIT_BUREAU_DAY                        : num [1:30000] -0.0698 -0.0698 -0.0698 -0.0698 -0.0698 ...
 $ AMT_REQ_CREDIT_BUREAU_WEEK                       : num [1:30000] -0.157 -0.157 -0.157 -0.157 -0.157 ...
 $ AMT_REQ_CREDIT_BUREAU_MON                        : num [1:30000] -0.295 -0.295 -0.295 -0.295 -0.295 ...
 $ AMT_REQ_CREDIT_BUREAU_QRT                        : num [1:30000] -0.42 2.87 1.22 -0.42 -0.42 ...
 $ AMT_REQ_CREDIT_BUREAU_YEAR                       : num [1:30000] 0.577 0.577 0.043 1.112 -0.491 ...
 $ OBS_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.619 -0.194 0.231 -0.619 1.082 ...
 $ DEF_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.338 1.888 -0.338 -0.338 1.888 ...
 $ OBS_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.617 -0.189 0.24 -0.617 1.098 ...
 $ DEF_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.291 2.451 -0.291 -0.291 2.451 ...
 $ DAYS_LAST_PHONE_CHANGE                           : num [1:30000] 0.35 0.608 0.261 -0.286 0.422 ...
 $ TARGET                                           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ CODE_GENDER_M                                    : num [1:30000] 1 0 0 0 0 1 0 0 1 1 ...
 $ NAME_CONTRACT_TYPE_Revolving.loans               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ FLAG_OWN_CAR_Y                                   : num [1:30000] 1 0 0 0 0 1 0 0 0 1 ...
 $ FLAG_OWN_REALTY_Y                                : num [1:30000] 0 1 1 1 1 1 1 0 0 1 ...
 $ NAME_INCOME_TYPE_Pensioner                       : num [1:30000] 0 0 1 1 0 0 1 1 0 0 ...
 $ NAME_INCOME_TYPE_State.servant                   : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
 $ NAME_INCOME_TYPE_Student                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_INCOME_TYPE_Working                         : num [1:30000] 1 1 0 0 0 0 0 0 1 1 ...
 $ NAME_EDUCATION_TYPE_Higher.education             : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Incomplete.higher            : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Lower.secondary              : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 0 1 1 1 1 1 1 1 ...
 $ NAME_FAMILY_STATUS_Married                       : num [1:30000] 0 0 0 1 0 1 1 1 0 1 ...
 $ NAME_FAMILY_STATUS_Separated                     : num [1:30000] 0 1 1 0 0 0 0 0 1 0 ...
 $ NAME_FAMILY_STATUS_Single...not.married          : num [1:30000] 1 0 0 0 1 0 0 0 0 0 ...
 $ NAME_FAMILY_STATUS_Widow                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_House...apartment              : num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
 $ NAME_HOUSING_TYPE_Municipal.apartment            : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_Office.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_Rented.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_With.parents                   : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_MONDAY                : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SATURDAY              : num [1:30000] 0 0 0 0 1 1 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SUNDAY                : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_THURSDAY              : num [1:30000] 1 0 0 0 0 0 0 1 0 1 ...
 $ WEEKDAY_APPR_PROCESS_START_TUESDAY               : num [1:30000] 0 0 0 0 0 0 0 0 1 0 ...
 $ WEEKDAY_APPR_PROCESS_START_WEDNESDAY             : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
 $ REG_REGION_NOT_LIVE_REGION                       : num [1:30000] -0.122 -0.122 -0.122 -0.122 -0.122 ...
 $ AMT_ANNUITY                                      : num [1:30000] -1.19 0.348 0.367 -0.465 0.131 ...
 $ AMT_CREDIT                                       : num [1:30000] -0.9487 0.5193 0.4048 -0.0999 -0.5129 ...
 $ CNT_CHILDREN                                     : num [1:30000] -0.59 0.77 -0.59 -0.59 -0.59 ...
 $ AMT_INCOME_TOTAL                                 : num [1:30000] 0.547 0.547 -0.583 -0.999 0.547 ...
 $ AMT_GOODS_PRICE                                  : num [1:30000] -0.863 0.507 0.395 -0.191 -0.515 ...
 $ DAYS_EMPLOYED                                    : num [1:30000] -0.449 -0.47 2.222 2.222 -0.449 ...
 $ DAYS_REGISTRATION                                : num [1:30000] 1.17 -0.958 -2.005 -1.466 0.939 ...
 $ DAYS_BIRTH                                       : num [1:30000] -0.635 -0.942 -1.398 -1.254 -0.645 ...
 $ AMT_REQ_CREDIT_BUREAU_HOUR                       : num [1:30000] -0.0718 -0.0718 -0.0718 -0.0718 -0.0718 ...
 $ AMT_REQ_CREDIT_BUREAU_DAY                        : num [1:30000] -0.0698 -0.0698 -0.0698 -0.0698 -0.0698 ...
 $ AMT_REQ_CREDIT_BUREAU_WEEK                       : num [1:30000] -0.157 -0.157 -0.157 -0.157 -0.157 ...
 $ AMT_REQ_CREDIT_BUREAU_MON                        : num [1:30000] -0.295 -0.295 -0.295 -0.295 -0.295 ...
 $ AMT_REQ_CREDIT_BUREAU_QRT                        : num [1:30000] -0.42 2.87 1.22 -0.42 -0.42 ...
 $ AMT_REQ_CREDIT_BUREAU_YEAR                       : num [1:30000] 0.577 0.577 0.043 1.112 -0.491 ...
 $ OBS_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.619 -0.194 0.231 -0.619 1.082 ...
 $ DEF_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.338 1.888 -0.338 -0.338 1.888 ...
 $ OBS_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.617 -0.189 0.24 -0.617 1.098 ...
 $ DEF_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.291 2.451 -0.291 -0.291 2.451 ...
 $ DAYS_LAST_PHONE_CHANGE                           : num [1:30000] 0.35 0.608 0.261 -0.286 0.422 ...
 $ TARGET                                           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ CODE_GENDER_M                                    : num [1:30000] 1 0 0 0 0 1 0 0 1 1 ...
 $ NAME_CONTRACT_TYPE_Revolving.loans               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ FLAG_OWN_CAR_Y                                   : num [1:30000] 1 0 0 0 0 1 0 0 0 1 ...
 $ FLAG_OWN_REALTY_Y                                : num [1:30000] 0 1 1 1 1 1 1 0 0 1 ...
 $ NAME_INCOME_TYPE_Pensioner                       : num [1:30000] 0 0 1 1 0 0 1 1 0 0 ...
 $ NAME_INCOME_TYPE_State.servant                   : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
 $ NAME_INCOME_TYPE_Student                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_INCOME_TYPE_Working                         : num [1:30000] 1 1 0 0 0 0 0 0 1 1 ...
 $ NAME_EDUCATION_TYPE_Higher.education             : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Incomplete.higher            : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Lower.secondary              : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 0 1 1 1 1 1 1 1 ...
 $ NAME_FAMILY_STATUS_Married                       : num [1:30000] 0 0 0 1 0 1 1 1 0 1 ...
 $ NAME_FAMILY_STATUS_Separated                     : num [1:30000] 0 1 1 0 0 0 0 0 1 0 ...
 $ NAME_FAMILY_STATUS_Single...not.married          : num [1:30000] 1 0 0 0 1 0 0 0 0 0 ...
 $ NAME_FAMILY_STATUS_Widow                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_House...apartment              : num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
 $ NAME_HOUSING_TYPE_Municipal.apartment            : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_Office.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_Rented.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_With.parents                   : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_MONDAY                : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SATURDAY              : num [1:30000] 0 0 0 0 1 1 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SUNDAY                : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_THURSDAY              : num [1:30000] 1 0 0 0 0 0 0 1 0 1 ...
 $ WEEKDAY_APPR_PROCESS_START_TUESDAY               : num [1:30000] 0 0 0 0 0 0 0 0 1 0 ...
 $ WEEKDAY_APPR_PROCESS_START_WEDNESDAY             : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...

Model Workflow

We define the KNN model, set up a parameter grid for tuning, and combine it with the preprocessing recipe.

Random Forest (RF)

Recipe Definition

For RF, we use a similar recipe to logistic regression with log transformations, centering, and scaling of numerical predictors.

tibble [30,000 × 48] (S3: tbl_df/tbl/data.frame)
 $ REG_REGION_NOT_LIVE_REGION                       : num [1:30000] -0.122 -0.122 -0.122 -0.122 -0.122 ...
 $ AMT_ANNUITY                                      : num [1:30000] -1.505 0.555 0.57 -0.259 0.37 ...
 $ AMT_CREDIT                                       : num [1:30000] -1.078 0.738 0.656 0.222 -0.268 ...
 $ CNT_CHILDREN                                     : num [1:30000] -0.59 0.77 -0.59 -0.59 -0.59 ...
 $ AMT_INCOME_TOTAL                                 : num [1:30000] 0.827 0.827 -0.62 -1.578 0.827 ...
 $ AMT_GOODS_PRICE                                  : num [1:30000] -0.914 0.741 0.658 0.128 -0.281 ...
 $ DAYS_EMPLOYED                                    : num [1:30000] -0.237 0.142 2.021 2.021 -0.228 ...
 $ DAYS_REGISTRATION                                : num [1:30000] -1.316 0.966 1.635 1.306 -0.886 ...
 $ DAYS_BIRTH                                       : num [1:30000] 0.673 0.949 1.343 1.221 0.682 ...
 $ AMT_REQ_CREDIT_BUREAU_HOUR                       : num [1:30000] -0.0718 -0.0718 -0.0718 -0.0718 -0.0718 ...
 $ AMT_REQ_CREDIT_BUREAU_DAY                        : num [1:30000] -0.0698 -0.0698 -0.0698 -0.0698 -0.0698 ...
 $ AMT_REQ_CREDIT_BUREAU_WEEK                       : num [1:30000] -0.157 -0.157 -0.157 -0.157 -0.157 ...
 $ AMT_REQ_CREDIT_BUREAU_MON                        : num [1:30000] -0.295 -0.295 -0.295 -0.295 -0.295 ...
 $ AMT_REQ_CREDIT_BUREAU_QRT                        : num [1:30000] -0.42 2.87 1.22 -0.42 -0.42 ...
 $ AMT_REQ_CREDIT_BUREAU_YEAR                       : num [1:30000] 0.577 0.577 0.043 1.112 -0.491 ...
 $ OBS_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.619 -0.194 0.231 -0.619 1.082 ...
 $ DEF_30_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.338 1.888 -0.338 -0.338 1.888 ...
 $ OBS_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.617 -0.189 0.24 -0.617 1.098 ...
 $ DEF_60_CNT_SOCIAL_CIRCLE                         : num [1:30000] -0.291 2.451 -0.291 -0.291 2.451 ...
 $ DAYS_LAST_PHONE_CHANGE                           : num [1:30000] 0.35 0.608 0.261 -0.286 0.422 ...
 $ TARGET                                           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ CODE_GENDER_M                                    : num [1:30000] 1 0 0 0 0 1 0 0 1 1 ...
 $ NAME_CONTRACT_TYPE_Revolving.loans               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ FLAG_OWN_CAR_Y                                   : num [1:30000] 1 0 0 0 0 1 0 0 0 1 ...
 $ FLAG_OWN_REALTY_Y                                : num [1:30000] 0 1 1 1 1 1 1 0 0 1 ...
 $ NAME_INCOME_TYPE_Pensioner                       : num [1:30000] 0 0 1 1 0 0 1 1 0 0 ...
 $ NAME_INCOME_TYPE_State.servant                   : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
 $ NAME_INCOME_TYPE_Student                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_INCOME_TYPE_Working                         : num [1:30000] 1 1 0 0 0 0 0 0 1 1 ...
 $ NAME_EDUCATION_TYPE_Higher.education             : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Incomplete.higher            : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Lower.secondary              : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_EDUCATION_TYPE_Secondary...secondary.special: num [1:30000] 1 1 0 1 1 1 1 1 1 1 ...
 $ NAME_FAMILY_STATUS_Married                       : num [1:30000] 0 0 0 1 0 1 1 1 0 1 ...
 $ NAME_FAMILY_STATUS_Separated                     : num [1:30000] 0 1 1 0 0 0 0 0 1 0 ...
 $ NAME_FAMILY_STATUS_Single...not.married          : num [1:30000] 1 0 0 0 1 0 0 0 0 0 ...
 $ NAME_FAMILY_STATUS_Widow                         : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_House...apartment              : num [1:30000] 1 1 1 1 0 1 1 1 1 1 ...
 $ NAME_HOUSING_TYPE_Municipal.apartment            : num [1:30000] 0 0 0 0 1 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_Office.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_Rented.apartment               : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ NAME_HOUSING_TYPE_With.parents                   : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_MONDAY                : num [1:30000] 0 1 0 1 0 0 1 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SATURDAY              : num [1:30000] 0 0 0 0 1 1 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_SUNDAY                : num [1:30000] 0 0 0 0 0 0 0 0 0 0 ...
 $ WEEKDAY_APPR_PROCESS_START_THURSDAY              : num [1:30000] 1 0 0 0 0 0 0 1 0 1 ...
 $ WEEKDAY_APPR_PROCESS_START_TUESDAY               : num [1:30000] 0 0 0 0 0 0 0 0 1 0 ...
 $ WEEKDAY_APPR_PROCESS_START_WEDNESDAY             : num [1:30000] 0 0 1 0 0 0 0 0 0 0 ...

Model Workflow

We define the RF model, set up a parameter grid for tuning, and combine it with the preprocessing recipe.

Fine-Tuning and Training Models

We’ll proceed by fine-tuning and training our models: logistic regression, k-Nearest Neighbors (KNN), and random forest (RF). For each model, we’ll use cross-validation, perform a grid search for hyperparameter tuning (where applicable), and save the results. Finally, we’ll analyze the performance metrics for each model, including confusion matrices, to assess their effectiveness on the test data.

Logistic Regression, KNN, and Random Forest

Lets begin fine tuning!

table output
Logistic Regression Cross-Validation Results
Model Metric Mean Standard Error
Transformed accuracy 0.7403001 0.0011883
Transformed roc_auc 0.6592071 0.0037212
Regular accuracy 0.7405333 0.0009099
Regular roc_auc 0.6610435 0.0033060
table output
K-Nearest Neighbors Performance Metrics
Model Metric Estimate
Transformed accuracy 0.8556667
Transformed kap 0.0194780
No Transformation accuracy 0.8575000
No Transformation kap 0.0291659
table output
Random Forest Performance Metrics
Model Metric Estimate
Transformed accuracy 0.9195
Transformed kap 0.0000

Analyzing Model Performance

After training the models, we analyze their performance using the metrics collected during cross-validation and the predictions on the test dataset. We are going to determine the best model by considering the trade offs between false positives and false negatives and each models ROC_AUC parameter.

False Positives (FP)

  • Definition: False positives occur when the model predicts that a borrower will default on their loan (i.e., TARGET = 1), but in reality, the borrower does not default.
  • Impact:
    • Financial Cost: The model incorrectly identifies a credit-worthy borrower as a risk, which could lead to unnecessary denial of credit. This might result in lost opportunities for the lender and potential revenue.
    • Customer Experience: Borrowers who are incorrectly labeled as high-risk might experience frustration or inconvenience if they are denied credit or face higher interest rates.

False Negatives (FN)

  • Definition: False negatives occur when the model predicts that a borrower will not default on their loan (i.e., TARGET = 0), but in reality, the borrower does default.
  • Impact:
    • Financial Risk: The model fails to identify a high-risk borrower, potentially leading to financial losses due to defaults that could have been anticipated and mitigated.
    • Risk Management: The lender might face higher-than-expected default rates, which can affect profitability and increase the need for more stringent risk management strategies.

Balancing False Positives and False Negatives

In credit risk prediction, it’s crucial to balance false positives and false negatives:

  • Minimizing False Positives: Reducing false positives helps in approving more credit-worthy applicants. However, if reduced too much, it might lead to increased false negatives.

  • Minimizing False Negatives: Reducing false negatives ensures that potential defaults are caught early, but if too aggressive, it might result in higher false positives.

Lets analyze the confusion matrix of all models when it is fit to the training data. While this is not common practice and a lot of this computation could have been saved before hand by analyzing our ROC_AUC and accuracy metrics, I believe in this case it was important to get the full picture because high accuracy in our case may not mean a successful model. For example if our model guesses the majority class, it is guaranteed above a 90% accuracy for representative samples of our population data. Therefore, by analyzing the confusion matrix before hand we can make a better informed decision when ultimately declaring a model as the winner.

Final Model Evaluation

After experimenting with multiple models, we determined that the K-Nearest Neighbors (KNN) model without transformations surprisingly outperformed other models like Logistic Regression and Random Forest. While KNN’s success might seem counterintuitive given its sensitivity to unscaled data and outliers, it performed best in terms of the ROC AUC metric. However, despite being the best-performing model from the bunch, there are several important considerations regarding its actual effectiveness.

1. ROC AUC Performance and Interpretation:

  • The ROC AUC for the KNN model was surprisingly low, and the ROC curve appeared below the y = x line. This is significant because the y = x line represents a random classifier (i.e., a model with no discriminative power, where the TPR equals the FPR). When the ROC curve falls below this line, it suggests that the model is performing worse than random guessing.
  • Why this happened: This outcome indicates that the model might be systematically predicting the opposite class or is heavily skewed by imbalanced data, leading to poor generalization on the test data. Despite the model achieving some level of performance during training and cross-validation, it struggles to differentiate between the positive and negative classes in real-world (testing) scenarios.

2. Confusion Matrix Analysis:

  • The confusion matrix for the KNN model reveals that the model has difficulty in identifying the positive (defaulting) class. The true positives (correctly predicted defaults) are low, while the false negatives (missed defaults) are high. However this is also the case for the other models except KNN along with Logistic Regression did not conform to only guessing the non defaulting class.
  • False Positives and False Negatives: In this context, the high number of false negatives is particularly concerning because it means the model is failing to identify borrowers who will actually default on their loans. This could result in significant financial risks if deployed in a real-world setting.

3. Baseline Model Comparison:

  • It’s essential to compare the KNN model to a baseline or null model. A baseline model could simply predict the majority class (e.g., predicting all borrowers as non-defaulting). If our KNN model’s performance is not substantially better than this baseline, the effort involved in building and tuning the model may not be justified.
  • Does the effort pay off?: Given the low ROC AUC and the model’s difficulty in identifying the defaulting class, the predictive power of the KNN model may not be worth the complexity and effort invested. In this case, a simpler approach, such as a rule-based system or a machine learning model that penalizes double or triple for incorrectly predicting the minority class with regularization to prevent it from being overwhelmed with the class imbalance, which would put greater emphasis to predict correctly from the minority class.

4. Challenges with Imbalanced Data:

  • Even though we applied techniques like upsampling to close the gap between the minority and majority classes, the Random Forest model (and to some extent, the other models) was still overwhelmed by the majority class. This underscores the challenge of dealing with imbalanced data, where traditional machine learning algorithms can struggle to learn meaningful patterns from the minority class.
  • Nonlinearities and Complexity: Random Forest, being a more complex model that handles nonlinearities well, should theoretically perform better in capturing intricate relationships. However, the imbalance in the dataset and perhaps overfitting to the majority class might have hindered its performance. This suggests that more advanced techniques and sophisticated resampling strategies, might be necessary.

5. ROC Curve and Performance Visualization:

  • To further assess the model’s performance, plotting the ROC curves for all models reveals that not only does the KNN model perform poorly, but the ROC curves for the other models also struggle to stay above the y = x line. This indicates that all models are facing difficulty in distinguishing between the positive and negative classes, with little improvement over random guessing.

Conclusion

  • Best of the models, But Not Enough: The KNN model without transformations performed best in our testing, but it still struggles with identifying the defaulting class accurately. The low ROC AUC and the confusion matrix results indicate that this model may not be reliable enough for real-world deployment.

  • Key Learnings: The imbalance in the dataset, model complexity, and the challenges of proper resampling all contributed to the difficulties faced by our models. Future work should consider more advanced resampling techniques, class-weighted models, or other approaches tailored to handling imbalanced data effectively.